Mapping the HiSeq sequences to the genome

# 1. Extract the sequences


First, prepare the Fastq files, and run tagdust:

```
python make_extraction_scripts.py
bash -x script.sh
```
This creates symbolic links to the sequencing files `/sequencedata/Hiseq2000/160428_SN554_0305_AC91WGACXX/Unaligned/Project_LS2744_RNhi10541_Lane?/Sample_RNhi10541/RNhi10541_??????_L00?_R1_001.fastq.gz`, and runs `tagdust` on them:
```
/home/mdehoon/bin/tagdust -t8 -arch tagdust.arch -o RNhi10541_<barcode>_L00<lane> RNhi10541_<barcode>_L00<lane>_R1_001.fastq.gz
```
Remove the Fastq files where the bar code could not be identified:
```
rm RNhi10541_??????_L00?_un.fq
```

Create Fastq files by sample:
```
tail -n +2 ../MiSeq/multiplex.txt | grep Timecourse | cut -f 3,4,5 | while read name bc index ;
do
    for f in RNhi10541_"$index"_L00?_BC_$bc.fq; do
        cat $f >> $name.fq ;
    done
done
```
Note that the HiSeq libraries contain samples for the time courses only.

Remove the files created by `tagdust`:
```
rm RNhi10541_??????_L00?_BC_???.fq
```

Also remove the linked files:
```
rm RNhi10541_??????_L00?_R1_001.fastq.gz
```

# 2. Collect unique sequences

Create a Fasta file with the unique sequences in each library:
```
python make_unique_sequence_scripts.py 
bash -x script.sh
```
This will run
```
python collect_unique_sequences.py <library>
```
for each library, and generate a Fasta file `<library>.fa` with the unique sequences in each library.
Merge these files to generate a single file `seqlist.fa` with the 14,070,337 unique sequences across all 18 libraries:
```
python merge_unique_sequences.py 
```
Remove the intermediate files:
```
rm t??_r[1-3].fa
```
Create an index associating a unique sequence with each sequenced read:
```
python make_sequence_index_scripts.py
bash -x script.sh
```
This will run
```
python make_sequence_index.py <library>
```
for each library, and generate an index file `<library>.index.txt` for each.
Remove the intermediate files:
```
rm script.sh
rm script_t??_r?.sh
rm script_t??_r?.stdout
rm script_t??_r?.stderr
```
Sort the index files by name of the unique sequences:
```
sort -k 2 -o t00_r1.index.txt t00_r1.index.txt
sort -k 2 -o t00_r2.index.txt t00_r2.index.txt
sort -k 2 -o t00_r3.index.txt t00_r3.index.txt
sort -k 2 -o t01_r1.index.txt t01_r1.index.txt
sort -k 2 -o t01_r2.index.txt t01_r2.index.txt
sort -k 2 -o t01_r3.index.txt t01_r3.index.txt
sort -k 2 -o t04_r1.index.txt t04_r1.index.txt
sort -k 2 -o t04_r2.index.txt t04_r2.index.txt
sort -k 2 -o t04_r3.index.txt t04_r3.index.txt
sort -k 2 -o t12_r1.index.txt t12_r1.index.txt
sort -k 2 -o t12_r2.index.txt t12_r2.index.txt
sort -k 2 -o t12_r3.index.txt t12_r3.index.txt
sort -k 2 -o t24_r1.index.txt t24_r1.index.txt
sort -k 2 -o t24_r2.index.txt t24_r2.index.txt
sort -k 2 -o t24_r3.index.txt t24_r3.index.txt
sort -k 2 -o t96_r1.index.txt t96_r1.index.txt
sort -k 2 -o t96_r2.index.txt t96_r2.index.txt
sort -k 2 -o t96_r3.index.txt t96_r3.index.txt
```
Split the unique sequences into 100 files:
```
python split_sequences.py 100
```
which will create 100 files with names `seqlist_<number>.fa` with number ranging from 0 through 99. It will also create a file `skipped.fa` with the sequences containing internal N's, which were removed as BWA may choke on them.

Compress and store the Fastq files:
```
gzip t??_r?.fq
mv t??_r?.fq.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Fastq/
```
Count the number of reads in each sample:
```
python count_sequences.py
```
generating these counts:

| condition | replicate | number of reads |
| --------- | --------- | --------------- |
|    t00    |     r1    |     46504440    |
|    t00    |     r2    |     40266578    |
|    t00    |     r3    |     38037329    |
|    t01    |     r1    |     52136630    |
|    t01    |     r2    |     39659912    |
|    t01    |     r3    |        36158    |
|    t04    |     r1    |     33806605    |
|    t04    |     r2    |     38603727    |
|    t04    |     r3    |     45081756    |
|    t12    |     r1    |     47612643    |
|    t12    |     r2    |     41905665    |
|    t12    |     r3    |     42628669    |
|    t24    |     r1    |     41203962    |
|    t24    |     r2    |     37153211    |
|    t24    |     r3    |     28795459    |
|    t96    |     r1    |     35172799    |
|    t96    |     r2    |     44639452    |
|    t96    |     r3    |     20876225    |

Note that the third replicate of the 1 hour time point was a negative control in which the RNA sample was replaced by water.

Create directories to store the mapping results:
```
mkdir /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mkdir /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

# 3. Filter against chrM, ribosomal RNA, and tRNAs

## 3.1 Filter against chrM:

Create the scripts to filter against mitochondrial DNA, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py chrM 100
bash -x script.sh
```
This will run
```
python filter.py chrM <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the chrM forward and reverse genome sequence. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py chrM
```
generating the merged file `chrM.psl`.
Remove the intermediate files:
```
rm script_chrM_*.sh
rm script_chrM_*.stdout
rm script_chrM_*.stderr
```

Convert the coordinates of the forward and reverse strand of chrM to genomic coordinates:

```
target=chrM
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py chrM 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py chrM <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_chrM_*.stdout | grep -c Done
cat script_chrM_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script.sh
rm script_chrM_*.sh
rm script_chrM_*.stdout
rm script_chrM_*.stderr
```

Store the mapping results created by this script:
```
gzip chrM.psl
mv chrM.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv chrM.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm chrM.*.psl
```

## 3.2 Filter against ribosomal RNA:

Create the scripts to filter against ribosomal RNA, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py rRNA 100
bash -x script.sh
```
This will run
```
python filter.py rRNA <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the ribosomal RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py rRNA
```
generating the merged file `rRNA.psl`. Remove the intermediate files:
```
rm script_rRNA_*.sh
rm script_rRNA_*.stdout
rm script_rRNA_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=rRNA
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py rRNA 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py rRNA <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_rRNA_*.stdout | grep -c Done
cat script_rRNA_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_rRNA_*.sh
rm script_rRNA_*.stdout
rm script_rRNA_*.stderr
```
Store the mapping results:
```
gzip rRNA.psl
mv rRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv rRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm rRNA.*.psl
```

## 3.3 Filter against transfer RNA:

Create the scripts to filter against transfer RNA, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py tRNA 100
bash -x script.sh
```
This will run
```
python filter.py tRNA <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the transfer RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py tRNA
```
generating the merged file `tRNA.psl`. Remove the intermediate files:
```
rm script_tRNA_*.sh
rm script_tRNA_*.stdout
rm script_tRNA_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=tRNA
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py tRNA 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py tRNA <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_tRNA_*.stdout | grep -c Done
cat script_tRNA_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_tRNA_*.sh
rm script_tRNA_*.stdout
rm script_tRNA_*.stderr
```
Store the mapping results:
```
gzip tRNA.psl
mv tRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv tRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm tRNA.*.psl
```

# 4. Align against known transcripts

## 4.1 Small nuclear RNAs (spliceosomal RNAs)

Create the scripts to filter against small nuclear RNAs, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py snRNA 100
bash -x script.sh
```
This will run
```
python filter.py snRNA <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the small nuclear RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py snRNA
```
generating the merged file `snRNA.psl`. Remove the intermediate files:
```
rm script_snRNA_*.sh
rm script_snRNA_*.stdout
rm script_snRNA_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=snRNA
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py snRNA 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py snRNA <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_snRNA_*.stdout | grep -c Done
cat script_snRNA_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_snRNA_*.sh
rm script_snRNA_*.stdout
rm script_snRNA_*.stderr
```
Store the mapping results:
```
gzip snRNA.psl
mv snRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv snRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm snRNA.*.psl
```

## 4.2 Small cytoplasmic RNAs (7SL RNAs and Brain cytoplasmic RNA 1)

Create the scripts to filter against small cytoplasmic RNAs, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py scRNA 100
bash -x script.sh
```
This will run
```
python filter.py scRNA <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the small cytoplasmic RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py scRNA
```
generating the merged file `scRNA.psl`. Remove the intermediate files:
```
rm script_scRNA_*.sh
rm script_scRNA_*.stdout
rm script_scRNA_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=scRNA
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py scRNA 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py scRNA <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_scRNA_*.stdout | grep -c Done
cat script_scRNA_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_scRNA_*.sh
rm script_scRNA_*.stdout
rm script_scRNA_*.stderr
```
Store the mapping results:
```
gzip scRNA.psl
mv scRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv scRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm scRNA.*.psl
```

## 4.3 Small nucleolar RNAs

Create the scripts to filter against small nucleolar RNAs, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py snoRNA 100
bash -x script.sh
```
This will run
```
python filter.py snoRNA <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the small nucleolar RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py snoRNA
```
generating the merged file `snoRNA.psl`. Remove the intermediate files:
```
rm script_snoRNA_*.sh
rm script_snoRNA_*.stdout
rm script_snoRNA_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=snoRNA
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py snoRNA 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py snoRNA <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_snoRNA_*.stdout | grep -c Done
cat script_snoRNA_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_snoRNA_*.sh
rm script_snoRNA_*.stdout
rm script_snoRNA_*.stderr
```
Store the mapping results:
```
gzip snoRNA.psl
mv snoRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv snoRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm snoRNA.*.psl
```

## 4.4 Ro-associated RNAs Y1/Y3/Y4/Y5

Create the scripts to filter against Ro-associated RNA sequences, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py yRNA 100
bash -x script.sh
```
This will run
```
python filter.py yRNA <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the Ro-associated RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py yRNA
```
generating the merged file `yRNA.psl`. Remove the intermediate files:
```
rm script_yRNA_*.sh
rm script_yRNA_*.stdout
rm script_yRNA_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=yRNA
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py yRNA 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py yRNA <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_yRNA_*.stdout | grep -c Done
cat script_yRNA_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_yRNA_*.sh
rm script_yRNA_*.stdout
rm script_yRNA_*.stderr
```
Store the mapping results:
```
gzip yRNA.psl
mv yRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv yRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm yRNA.*.psl
```

## 4.5 Histone genes

Create the scripts to filter against histone mRNA sequences, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py histone 100
bash -x script.sh
```
This will run
```
python filter.py histone <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the histone mRNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py histone
```
generating the merged file `histone.psl`. Remove the intermediate files:
```
rm script_histone_*.sh
rm script_histone_*.stdout
rm script_histone_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=histone
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py histone 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py histone <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_histone_*.stdout | grep -c Done
cat script_histone_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_histone_*.sh
rm script_histone_*.stdout
rm script_histone_*.stderr
```
Store the mapping results:
```
gzip histone.psl
mv histone.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv histone.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm histone.*.psl
```

## 4.6 RNA component of mitochondrial RNA processing endoribonuclease (RMRP)

Create the scripts to filter against the RMRP transcript sequence, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py RMRP 100
bash -x script.sh
```
This will run
```
python filter.py RMRP <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the RMRP transcript sequence. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py RMRP
```
generating the merged file `RMRP.psl`. Remove the intermediate files:
```
rm script_RMRP_*.sh
rm script_RMRP_*.stdout
rm script_RMRP_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=RMRP
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py RMRP 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py RMRP <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_RMRP_*.stdout | grep -c Done
cat script_RMRP_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_RMRP_*.sh
rm script_RMRP_*.stdout
rm script_RMRP_*.stderr
```
Store the mapping results:
```
gzip RMRP.psl
mv RMRP.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv RMRP.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm RMRP.*.psl
```

## 4.7 Small Cajal body-specific RNAs

Create the scripts to filter against small Cajal body-specific RNAs, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py scaRNA 100
bash -x script.sh
```
This will run
```
python filter.py scaRNA <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the small Cajal body-specific RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py scaRNA
```
generating the merged file `scaRNA.psl`. Remove the intermediate files:
```
rm script_scaRNA_*.sh
rm script_scaRNA_*.stdout
rm script_scaRNA_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=scaRNA
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py scaRNA 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py scaRNA <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_scaRNA_*.stdout | grep -c Done
cat script_scaRNA_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_scaRNA_*.sh
rm script_scaRNA_*.stdout
rm script_scaRNA_*.stderr
```
Store the mapping results:
```
gzip scaRNA.psl
mv scaRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv scaRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm scaRNA.*.psl
```

## 4.8 RNA component of the RNase P ribonucleoprotein

Create the scripts to filter against the RPPH transcript sequence, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py RPPH 100
bash -x script.sh
```
This will run
```
python filter.py RPPH <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the RPPH transcript sequence. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py RPPH
```
generating the merged file `RPPH.psl`. Remove the intermediate files:
```
rm script_RPPH_*.sh
rm script_RPPH_*.stdout
rm script_RPPH_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=RPPH
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py RPPH 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py RPPH <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_RPPH_*.stdout | grep -c Done
cat script_RPPH_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_RPPH_*.sh
rm script_RPPH_*.stdout
rm script_RPPH_*.stderr
```
Store the mapping results:
```
gzip RPPH.psl
mv RPPH.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv RPPH.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm RPPH.*.psl
```

## 4.9 Small ILF3/NF90-associated RNAs

Create the scripts to filter against small ILF3/NF90-associated RNA sequences, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py snar 100
bash -x script.sh
```
This will run
```
python filter.py snar <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the small ILF3/NF90-associated RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py snar
```
generating the merged file `snar.psl`. Remove the intermediate files:
```
rm script_snar_*.sh
rm script_snar_*.stdout
rm script_snar_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=snar
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py snar 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py snar <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_snar_*.stdout | grep -c Done
cat script_snar_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_snar_*.sh
rm script_snar_*.stdout
rm script_snar_*.stderr
```
Store the mapping results:
```
gzip snar.psl
mv snar.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv snar.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm snar.*.psl
```

## 4.10 Telomerase RNA component (TERC)

Create the scripts to filter against the TERC transcript sequence, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py TERC 100
bash -x script.sh
```
This will run
```
python filter.py TERC <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the TERC transcript sequence. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py TERC
```
generating the merged file `TERC.psl`. Remove the intermediate files:
```
rm script_TERC_*.sh
rm script_TERC_*.stdout
rm script_TERC_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=TERC
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py TERC 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py TERC <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_TERC_*.stdout | grep -c Done
cat script_TERC_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_TERC_*.sh
rm script_TERC_*.stdout
rm script_TERC_*.stderr
```
Store the mapping results:
```
gzip TERC.psl
mv TERC.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv TERC.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm TERC.*.psl
```

## 4.11 Vault RNAs

Create the scripts to filter against vault sequences, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py vRNA 100
bash -x script.sh
```
This will run
```
python filter.py vRNA <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the vault RNA sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py vRNA
```
generating the merged file `vRNA.psl`. Remove the intermediate files:
```
rm script_vRNA_*.sh
rm script_vRNA_*.stdout
rm script_vRNA_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=vRNA
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py vRNA 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py vRNA <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_vRNA_*.stdout | grep -c Done
cat script_vRNA_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_vRNA_*.sh
rm script_vRNA_*.stdout
rm script_vRNA_*.stderr
```
Store the mapping results:
```
gzip vRNA.psl
mv vRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv vRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm vRNA.*.psl
```

## 4.12 Metastatis associated lung adenocarcinoma transcript 1

Create the scripts to filter against the MALAT1 transcript sequences, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py MALAT1 100
bash -x script.sh
```
This will run
```
python filter.py MALAT1 <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the MALAT1 transcript sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py MALAT1
```
generating the merged file `MALAT1.psl`. Remove the intermediate files:
```
rm script_MALAT1_*.sh
rm script_MALAT1_*.stdout
rm script_MALAT1_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=MALAT1
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py MALAT1 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py MALAT1 <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_MALAT1_*.stdout | grep -c Done
cat script_MALAT1_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_MALAT1_*.sh
rm script_MALAT1_*.stdout
rm script_MALAT1_*.stderr
```
Store the mapping results:
```
gzip MALAT1.psl
mv MALAT1.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv MALAT1.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm MALAT1.*.psl
```

## 4.13 Small nucleolar RNA host genes

Create the scripts to filter against small nucleolar RNA host gene transcript sequences, and run `script.sh` to schedule the jobs on Grid Engine:
```
python make_filter_scripts.py snhg 100
bash -x script.sh
```
This will run
```
python filter.py snhg <number>
```
which performs a Needleman-Wunsch global alignment of each unique sequenced read in `seqlist_<number>.fa` against the small nucleolar RNA host gene transcript sequences. The mapping results are saved as `.psl` files in the current directory. To combine the results, use
```
python merge_filtered.py snhg
```
generating the merged file `snhg.psl`. Remove the intermediate files:
```
rm script_snhg_*.sh
rm script_snhg_*.stdout
rm script_snhg_*.stderr
```

Convert the transcript coordinates to genomic coordinates:
```
target=snhg
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```
Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py snhg 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py snhg <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_snhg_*.stdout | grep -c Done
cat script_snhg_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_snhg_*.sh
rm script_snhg_*.stdout
rm script_snhg_*.stderr
```
Store the mapping results:
```
gzip snhg.psl
mv snhg.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv snhg.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```

Remove the intermediate files:
```
rm snhg.*.psl
```

# 5. Align against known transcripts using BWA

## 5.1 Messenger RNAs

Create a .2bit file with the mRNA sequences:
```
faToTwoBit /osc-fs_home/mdehoon/Data/CASPARs/Filters/mRNA.fa mRNA.2bit
```
Use BWA to map all HiSeq sequences to the messenger RNAs:
```
python make_alignment_scripts.py mRNA
bash -x script.sh
```
This maps the sequences to the messenger RNAs:
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y /osc-fs_home/mdehoon/Data/CASPARs/Filters/mRNA.fa seqlist_<number>.fa | samtools view -F 20 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> mRNA.<number>.out | sort -k 14 | pslRecalcMatch stdin mRNA.2bit seqlist_<number>.fa stdout | sort -k 10 > mRNA.<number>.psl
```
To combine the results, use
```
python merge_filtered.py mRNA
```
generating the merged file `mRNA.psl`.
Remove the intermediate files:
```
rm mRNA.2bit
rm mRNA.*.out
rm script_mRNA_*.sh
rm script_mRNA_*.stdout
rm script_mRNA_*.stderr
```
Convert the transcript coordinates to genomic coordinates:
```
target=mRNA
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```

Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py mRNA 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py mRNA <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_mRNA_*.stdout | grep -c Done
cat script_mRNA_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_mRNA_*.sh
rm script_mRNA_*.stdout
rm script_mRNA_*.stderr
```
Store the mapping results:
```
gzip mRNA.psl
mv mRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv mRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```
Remove the intermediate files:
```
rm mRNA.*.psl
```

## 5.2 Long non-coding RNAs

Create a .2bit file with the lncRNA sequences:
```
faToTwoBit /osc-fs_home/mdehoon/Data/CASPARs/Filters/lncRNA.fa lncRNA.2bit
```
Use BWA to map all HiSeq sequences to the long non-coding RNAs:
```
python make_alignment_scripts.py lncRNA
bash -x script.sh
```
This maps the sequences to the long non-coding RNAs:
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y /osc-fs_home/mdehoon/Data/CASPARs/Filters/lncRNA.fa seqlist_<number>.fa | samtools view -F 20 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> lncRNA.<number>.out | sort -k 14 | pslRecalcMatch stdin lncRNA.2bit seqlist_<number>.fa stdout | sort -k 10 > lncRNA.<number>.psl
```
To combine the results, use
```
python merge_filtered.py lncRNA
```
generating the merged file `lncRNA.psl`.
Remove the intermediate files:
```
rm lncRNA.2bit
rm lncRNA.*.out
rm script_lncRNA_*.sh
rm script_lncRNA_*.stdout
rm script_lncRNA_*.stderr
```
Convert the transcript coordinates to genomic coordinates:
```
target=lncRNA
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```

Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py lncRNA 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py lncRNA <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_lncRNA_*.stdout | grep -c Done
cat script_lncRNA_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_lncRNA_*.sh
rm script_lncRNA_*.stdout
rm script_lncRNA_*.stderr
```
Store the mapping results:
```
gzip lncRNA.psl
mv lncRNA.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv lncRNA.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```
Remove the intermediate files:
```
rm lncRNA.*.psl
```

## 5.3 Gencode transcripts

Create a .2bit file with the Gencode transcript sequences:
```
faToTwoBit /osc-fs_home/mdehoon/Data/CASPARs/Filters/gencode.fa gencode.2bit
```
Use BWA to map all HiSeq sequences to the Gencode transcripts:
```
python make_alignment_scripts.py gencode
bash -x script.sh
```
This maps the sequences to the Gencode transcripts:
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y /osc-fs_home/mdehoon/Data/CASPARs/Filters/gencode.fa seqlist_<number>.fa | samtools view -F 20 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> gencode.<number>.out | sort -k 14 | pslRecalcMatch stdin gencode.2bit seqlist_<number>.fa stdout | sort -k 10 > gencode.<number>.psl
```
To combine the results, use
```
python merge_filtered.py gencode
```
generating the merged file `gencode.psl`.
Remove the intermediate files:
```
rm gencode.2bit
rm gencode.*.out
rm script_gencode_*.sh
rm script_gencode_*.stdout
rm script_gencode_*.stderr
```
Convert the transcript coordinates to genomic coordinates:
```
target=gencode
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```

Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py gencode 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py gencode <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_gencode_*.stdout | grep -c Done
cat script_gencode_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_gencode_*.sh
rm script_gencode_*.stdout
rm script_gencode_*.stderr
```
Store the mapping results:
```
gzip gencode.psl
mv gencode.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv gencode.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```
Remove the intermediate files:
```
rm gencode.*.psl
```

## 5.4 FANTOM-CAT

Create a .2bit file with the FANTOM-CAT transcript sequences:
```
faToTwoBit /osc-fs_home/mdehoon/Data/CASPARs/Filters/fantomcat.fa fantomcat.2bit
```
Use BWA to map all HiSeq sequences to the FANTOM-CAT transcripts:
```
python make_alignment_scripts.py fantomcat
bash -x script.sh
```
This maps the sequences to the FANTOM-CAT transcripts:
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y /osc-fs_home/mdehoon/Data/CASPARs/Filters/fantomcat.fa seqlist_<number>.fa | samtools view -F 20 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> fantomcat.<number>.out | sort -k 14 | pslRecalcMatch stdin fantomcat.2bit seqlist_<number>.fa stdout | sort -k 10 > fantomcat.<number>.psl
```
To combine the results, use
```
python merge_filtered.py fantomcat
```
generating the merged file `fantomcat.psl`.
Remove the intermediate files:
```
rm fantomcat.2bit
rm fantomcat.*.out
rm script_fantomcat_*.sh
rm script_fantomcat_*.stdout
rm script_fantomcat_*.stderr
```
Convert the transcript coordinates to genomic coordinates:
```
target=fantomcat
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
pslMap -mapInfo=$target.info $target.psl /osc-fs_home/mdehoon/Data/CASPARs/Filters/$target.psl stdout | psl2sam.pl >> $target.sam
samtools view -hb $target.sam -o $target.bam
python add_targets.py $target
rm $target.sam
rm $target.info
```

Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py fantomcat 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py fantomcat <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_fantomcat_*.stdout | grep -c Done
cat script_fantomcat_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_fantomcat_*.sh
rm script_fantomcat_*.stdout
rm script_fantomcat_*.stderr
```
Store the mapping results:
```
gzip fantomcat.psl
mv fantomcat.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv fantomcat.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```
Remove the intermediate files:
```
rm fantomcat.*.psl
```

# 6. Perform alignments to the genome using BWA

Create a list of chromosomes exclude the haplotype sequences with names ending in `_alt`:
```
twoBitInfo /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.2bit stdout | cut -f 1 | grep -v _alt > seqList
```
Create a Fasta file for these chromosomes:
```
twoBitToFa -noMask -seqList=seqList /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.2bit hg38.fa
```
Create the corresponding twoBit file:
```
faToTwoBit hg38.fa hg38.2bit
```
Create an index of the genome for BWA:
```
bwa index hg38.fa
```
Use BWA to map all remaining HiSeq sequences to the genome:
```
python make_alignment_scripts.py genome
bash -x script.sh
```
This maps the sequences to the genome:
```
bwa mem -O 0 -E 1 -A 1 -B 1 -T 10 -k 10 -c 100000000 -a -Y hg38.fa seqlist_<number>.fa | samtools view -F 4 -u | bamToPsl - stdout | pslCheck stdin -pass=stdout -quiet 2> genome.<number>.out | sort -k 14 | pslRecalcMatch stdin hg38.2bit seqlist_<number>.fa stdout | sort -k 10 > genome.<number>.psl
```

To combine the results, use
```
python merge_filtered.py genome
```
Check the generated file `genome.psl`:
```
pslCheck genome.psl
```
No errors should be reported.

Remove the intermediate files:
```
rm genome.*.out
rm script_genome_*.sh
rm script_genome_*.stdout
rm script_genome_*.stderr
```

Convert the `.psl` file to a `.bam` file:

```
target=genome
echo -e '@HD\tVN:1.6' | samtools view -H -t /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | grep -v _alt > $target.sam
psl2sam.pl $target.psl >> $target.sam
samtools view -hb -o $target.bam $target.sam
python add_targets.py $target
rm $target.sam
```

Remove the mapped sequences from the Fasta files:
```
python make_removal_scripts.py genome 100
bash -x script.sh
```
This will run
```
python remove_mapped_sequences.py genome <number>
```
which rewrites the `seqlist_<number>.fa` files, retaining the unmapped sequences only. Check the output files:
```
cat script_genome_*.stdout | grep -c Done
cat script_genome_*.stderr
```
If there were no errors, we can remove the temporary files:
```
rm script_genome_*.sh
rm script_genome_*.stdout
rm script_genome_*.stderr
```
Store the mapping results:
```
gzip genome.psl
mv genome.psl.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/PSL
mv genome.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/BAM
```
Remove the intermediate files:
```
rm hg38.fa
rm hg38.fa.amb
rm hg38.fa.ann
rm hg38.fa.bwt
rm hg38.fa.pac
rm hg38.fa.sa
rm hg38.2bit
rm seqList
rm genome.*.psl
```

# 7. Merge the BWA results

Merge the mapping results for the unique sequences:
```
python mergebam.py
```
generating the file `seqlist.bam`.

# 8. Add the sequence data to the BAM file

Add the query sequences in `seqlist.fa` to the `seqlist.bam` BAM file:
```
python add_sequences.py
```
generating a new `seqlist.bam` file. Check if the BAM file is consistent:
```
samtools view -h seqlist.bam > /dev/null
```

# 9. Write a separate BAM file for each library

Write separate `.bam` files for each library by merging the index file and the `seqlist.bam` file:
```
python make_split_bam_scripts.py
bash -x script.sh
```
which runs
```
python split_bam.py <library>
```
for each library, generating a new file `<library>.bam`.

Move the `.bam` files to `/osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/`:
```
mv ???_r?.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
```

Remove the intermediate files:
```
rm script*.sh
rm script_*.stderr
rm script_*.stdout
rm seqlist.bam
```

Remove the remaining Fasta files and the index files:
```
rm seqlist.fa
rm skipped.fa
rm seqlist_*.fa
rm *.index.txt
```
